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Abstract 


We present two new statistical machine learning methods designed to learn 
on fully homomorphic encrypted (FHE) data. The introduction of FHE schemes 


following Gentry (2009) opens up the prospect of privacy preserving statistical 


machine learning analysis and modelling of encrypted data without compromising 
security constraints. We propose tailored algorithms for applying extremely ran¬ 
dom forests, involving a new cryptographic stochastic fraction estimator, and naive 
Bayes, involving a semi-parametric model for the class decision boundary, and show 
how they can be used to learn and predict from encrypted data. We demonstrate 
that these techniques perform competitively on a variety of classification data sets 
and provide detailed information about the computational practicalities of these 
and other EHE methods. 


Keywords: homomorphic encryption, data privacy, encrypted machine learning, 
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1 Introduction 


Privacy requirements around data can impede the uptake and application of statistical 
analysis and machine learning algorithms. Traditional cryptographic methods enable safe 
long-term storage of information, but when analysis is to be performed the data must hrst 


be decrypted. Rivest et al. (1978) initially showed that it may be possible to design an 


encryption scheme that supports restricted mathematical computations without decrypt¬ 


ing. However, it was not until Gentry (2009) that a scheme able to support theoretically 


arbitrary computation was proposed. Briefly here, these so-called homomorphic encryp¬ 
tion schemes allow for certain mathematical operators such as addition and multiplication 
to be performed directly on the cipher texts (encrypted data), yielding encrypted results 
which upon decryption render the same results as if the operations had been performed 
on the plain texts (original data). These schemes are reviewed in a companion report 
to this paper ( Aslett, Esperanga and Holmesf |2015 ) in a manner which is accessible to 
statisticians and machine learners with accompanying high level open source software in 
R to allow users to explore the various issue^ 


^In this report we will assume that the reader is familiar with the basic concepts of fully homomorphic 
encryption and some of the practical computational constraints, as overviewed in [Aslett, Esperanga and] 
Holmes (2015) and Gentry (2010). 
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Privacy constraints enter into many areas of modern data analysis from biobanks 
and medical data to the impending wave of ‘wearable devices’ snch as smart watches, 


generating large amonnts of personal biomedical data (Anderlik and Rothstein, 2001 


Kanfman et al., 2009 Angrist, 2013; Brenner, 2013 Ginsbnrg, 2014). Moreover with 
the advent of clond compnting many data owners are looking to ontsonrce storage and 
compnting, bnt particnlarly with non-centralised services there may be concerns with 
secnrity issnes dnring data analysis (Lin et al. 2011). Indeed, encryption may even 


be desirable on internal network connected systems as providing an additional layer of 
secnrity. 

Althongh homomorphic encryption in theory promises arbitrary compntation, the 


practical constraints mean that this is presently ont of reach for many algorithms (Aslett, 


Esperanga and Holmes, 2015). This motivates the interest in tailored machine learning 
methods which can be practically applied. This paper contribntes two snch methods 
with FHE approximations to extremely random forests and naive Bayes developed, snch 
that both learning and prediction can be performed encrypted, something which is not 
possible with the original versions of either techniqne. 

We are not the first to explore secnre machine learning approaches to encryption. 
Graepel et al (2012) implemented two binary classihcation algorithms for homomor- 
phically encrypted data: Linear Means and Fisher’s Linear Discriminant. They make 
scaling adjustments which preserve the results, but leave the fundamental methodology 
unchanged. Bost et al (2014) developed a two party computation framework and used 
a mix of different partly and fully homomorphic encryption schemes which allows them 
to use machine learning techniques based on hyperplane decisions, naive Bayes and bi¬ 
nary decision trees — again the fundamental methodologies are unchanged, but here 
substantial communication between two (‘honest but curious’) parties is required. 

These are two existing approaches to working within the constraints imposed by ho¬ 
momorphic encryption: either by the use of existing methods amenable to homomorphic 
computation; or by invoking multi-party methods. Here, we consider tailored approxima¬ 
tions to two statistical machine learning models which make them amenable to homomor¬ 
phic encryption, so that all stages of htting and prediction can be computed encrypted. 
Thus, herein we contribute two machine learning algorithms tailored to the framework 


of fully homomorphic encryption and provide an R package implementing them (Aslett 


and Esperanga, 2015). These techniques do not require multi party communication. 


Aside from classification techniques, other privacy preserving statistical methods have 


been proposed in the literature such as small-P linear regression (P < 5; Wu and Haven 


2012) and predictive machine learning using pre-trained models (e.g., logistic regression; 


Bos et al , 2014) 


In Section 2 a brief recap of homomorphic encryption and consequences for data 
representation is presented, with the unfamiliar reader directed to Aslett, Esperanga and| 


Holmes (2015) for a fuller review. Section 3 contains a novel implementation of extremely 


random forests (Geurts et al. 2006 Gutler and Zhao, 2001) including a stochastic ap¬ 


proximation to tree voting. In Section 4 a novel semi-parametric naive Bayes algorithm 
is developed that utilises logistic regression to define the decision boundaries. Section 
5 details empirical results of classification performance on a variety of tasks taken from 
the UGI machine learning repository, as well as demonstrating the practicality with per¬ 
formance metrics from fitting a completely random forest using the Amazon EG2 cloud 
platform. Section 6 offers a discussion and conclusions. 
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2 Homomorphic encryption and data representation 

We shall adopt a public key encryption scheme having pnblic key kp and secret key 
kg and eqnipped with algorithms Enc(fcp, •) and Dec(fc 5 , •) which encrypt and decrypt 
messages respectively. Encryption maps a message m G M from message space to an 
element of cipher text space c E C. A scheme is then said to be homomorphic for some 
operations o g J-m acting in message space (snch as addition or mnltiplication) if there 
are corresponding operations o G iFc acting in cipher text space satisfying the property: 


Dec(/cs, Enc(/cp, mi) o Enc{kp, m 2 )) = mi o m 2 Vmi, m 2 G M 
A scheme is fully homomorphic if it is homomorphic for both addition and mnltiplication. 


We shall consider herein the particular homomorphic encryption scheme of Fan and Ver- 


cauteren 

(2012 

), a high performance and easy to use implementation of which is available 

in R ( 

Aslett 

2 

014a 

), and assume that the reader is familiar with the basic principles of 

this approach 

(Aslett, Esperanga and Holmes, 2015). 


2.1 Practical limitations 


Although FHE schemes exist, it is worth briefly recalling the practical constraints in im¬ 
plementing arbitrary algorithms, as they impact and motivate the tailored developments 
presented in this paper. Some of the current practical implementation issues include: 
Message space: Real value encryption lies outside of existing FHE schemes so that 
measurements must typically be stored as integers. Given an integer measurement, x, 
the choice of the corresponding message space representation, M, will have consequences 
for computational cost and memory requirements. For example, x, could be directly 
represented in an integer message space M C Z, or in a binary message space M = {0,1}^ 
involves writing down the value in base 2 (a: = ^ ‘2^bi) and encrypting each bit (each bi) 
separately. 

The major consequence is that performing simple operations such as addition and 
multiplication under the binary representation involves manual binary arithmetic, which 
is much more expensive than the single operation involved when the natural integer 
representation is used. For instance, adding two 32-bit values in a binary representation 
would involve over 256 fundamental operations by using standard full binary adder logic. 
Consequently, we do not consider FHE schemes where binary representation is the only 
option and instead require M C Z for the new techniques to be presented in Sections 
andlH 

However, although M C Z is more efficient computationally, it still does not naturally 
accommodate the kinds of data commonly encountered in statistics and machine learning 
applications, so that even representing data requires careful consideration. 

Cipher text size: existing FHE schemes result in substantial inflation in the size of the 


data. For example, in Fan and Vercauteren (2012) the cipher text space is C = Zg[a;] x 


Zq[x], a cartesian product of high degree polynomial rings with coefficients belonging to 
a large integer ring. Therefore when using the default parameter values in that paper 
the two polynomials are of degree 4,095, with each of the 8,192 coefficients being 128-bit 
integers. This means that 1MB of message data can grow to approximately 16.4GB of 
encrypted data, representing a 1,600 fold increase in storage size. 

Computational speed: due in part to the increased data size, but also due to the 


complex cipher text spaces, the cost of performing operations is high. For example, in Fan 
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and Vercauteren (2012) arithmetic for simple messages in M is achieved by performing 


complex polynomial arithmetic in C. 

To make this concrete, imagine adding the numbers 2 and 3 to produce 5. Basic 


parameter choices for the Fan and Vercauteren (2012) encryption scheme will mean that 


not only does this simple addition involve adding 4, 095 degree polynomials, but the 128- 
bit integer coefficients of those polynomials are too large to be natively represented or 
operated on by modern CPUs. 

Indeed, the theoretical latency for integer addition on a modern CPU is 1 clock cycle, 
so that 2-1-3 executes in sub 1 nanosecond (10“®s). By contrast, the optimised C-I--I- im¬ 


plementation of Fan and Vercauteren (2012) in Aslett (2014a) takes around 3 milliseconds 


(10“^s) to perform the same computation encrypted. 

Division and comparison: existing integer message space schemes cannot perform 
encrypted division and are unable to evaluate binary comparison operations such as 
=, < and >. So that mathematical operations are currently restricted to addition and 
multiplication. 

Cryptographic noise: the semantic security necessary in existing schemes involves 
injection of some noise into the cipher texts, which grows as operations are performed. 
Typically the noise growth under multiplication can be signihcant so that after a certain 
depth of multiplications the cipher text must be ‘refreshed’. This refresh step is usually 
computationally expensive, so that in practice the parameters of the encryption scheme 
are usually chosen a priori to ensure that all necessary operations for the algorithm to be 
applied can be performed without any refresh being required. 

Thus, the restriction to integers, addition and multiplication, combined with a limit 
on noise growth emanating from multiplication operations, means that in reality the 
constraints of homomorphic encryption allow only moderate degree polynomials of inte¬ 
gers to be computed encrypted. Even so, the speed of evaluation will be relatively slow 
compared to the unencrypted counterparts, as demonstrated in our examples in Section 
5. 


2.2 Data representation 

One consequence of the above is that we need to transform data to make it amenable 
to FHE analysis. We show that certain transformations will also allow for limited forms 
of computation involving comparison operations such as =, < and >. We consider two 
simple approaches below. 


2.2.1 Quantisation for real values 


Given that many current homomorphic schemes work in the space of integers (Aslett, 


Esperanga and Holmes, 2015), it may be necessary to make approximations when manip¬ 


ulating real-valued variables. Graepel et al. (2012) proposed an approximation method 


where real values are hrst approximated by rationals (two integers factors) and then de¬ 
nominators cleared—by multiplying the entire dataset by a pre-specihed integer—and 
rounding the results to the nearest integer. 

One suggestion here is more straightforward: choose a desired level of accuracy, say 0, 
which represents the number of decimal places to be retained; then multiply the data by 
10*^ and round to the nearest integer. This avoids the need for rational approximations 
and the requirement for a double approximation caused by the denominator-clearing step. 
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More precisely, for a given precision 0 G NU{0}, a real value G M is approximated 
by i = [10'^ • z ], where [■] denotes rounding to the nearest integer. This transformation 
adequately represents real values in an integer space, in the sense that smooth relative 
distances are approximately maintained. 

For data sets of hnite precision (the typical case in real applications), no loss of pre¬ 
cision is necessary if 0 is selected to be equal to the accuracy (i.e., number of decimal 
places) of the most accurate value in the data set. Otherwise, in the cases where trans¬ 
formations are required (e.g., logarithms), precision is under the user’s control. Note 
that the parameter 0 regulates the accuracy of the input (data), not that of the output 
(result). To the extent that the output accuracy depends on the input accuracy and also 
on the complexity of the algorithm, the choice of 0 should take both these factors into 
consideration. 

In particular, when evaluating homogeneous polynomial expressions, then no interme¬ 
diate scaling is required since every term will have scaling 10'’*'^, where d is the degree of 
the homogeneous polynomial. Where scaling is required, it will be known a priori based 
on the algorithm and is not data dependent. 


2.2.2 Quantisation to categorical or ordinal 

The approach above encodes real values by an integer representation, but this increases 
substantially the number of multiplication operations involved. Instead, by transforming 
continuous measurement values into categorical or ordinal ones via a quantisation proce¬ 
dure, it is possible to dispense with the need to track appropriate scaling in the algorithm. 
This simple solution has not, as far as we’re aware, been taken in the applied cryptogra¬ 
phy literature to date. Moreover, this quantisation procedure allows some computations 
involving comparison operations (=, <, >) to now be performed as detailed below. 

Let X be a design matrix, with elements Xij recording the jth predictor variable for the 
ith observation. It may be that Xij G M or Xij may be a categorical value. In both cases, 
consider a partition of the support of variable j to be quantised JCj = {Kl,... ,X^}. 
That is, Xij G U^i ^ hJ and Xf fl Kl = 0 y j,y i ^ k. There are at least two routes 
one may take to quantisation; 


1. Xij is encoded as an indicator Xijk G {0,1} V k, where for each continuous variable 

j, Xijk = 1 Xij G Kl and Xiji = 0 \/ 1 k. For example, a natural ordinal 

choice for ICj is the partition induced by the quintiles of that variable. 

2. If the partition also satishes y < z y G Kl,z G Kl and i < k\/ j, then 

another option is to replace the value of Xij by a corresponding ordinal value, so 
that Xij = k Xij G Kl so that the support becomes Xij G {1, 2,..., rrij}. 


Both approaches transform continuous, categorical or ordinal values to an encoding 
which can be represented directly in the message space of homomorphic schemes. Note 
that for categorical or discrete variables in the design matrix, these procedures can be 
exact, whilst for continuous ones they may introduce an approximation. 

Thus, for example a design matrix X would map to two possible representations Xi 
and X2 corresponding to the different procedures above, with Xu G {0,1}, Xi2 G {1, 2, 3} 
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and Xj 3 G M: 


X = 


0 

1 

1.7 

1 

2 

1.9 

0 

3 

1.6 


= 


y Xi2 j 
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0 

0 

0 

0 

1 

0 
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1 
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1 
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0 

0 

0 
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1 
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0 
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1 

0 
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0 

0 


or, under method 2 —)■ X 2 = 


0 

1 

3 

1 

2 

5 

0 

3 

1 




21*1 


Xi2 Xis 


\ [X;11 \Xi2l Xi22 Xi23 Xj^i Xi32 Xg^, ^^34 Xi’i^\j 


Recall from 12.1 and Aslett, Esperanga and Holmes (2015) that comparisons of equal¬ 
ity cannot usually be made on encrypted content. However, Method 1 can be seen to 
enable encrypted indicators for simple tests of equality, since comparisons simply become 
inner products: 


'‘^Xi^jkXi^jk = 1 obs ii and *2 have equal quantised value on variable j (1) 

Vfc 

otherwise the sum is zero. In particular, note that this is a homogeneous polynomial of 
degree 2, requiring only 1 multiplication depth in the analysis. 

Likewise, it is possible to evaluate an encrypted indicator for whether a value lies in 
a given range, because: 


Xijk = 1 obs i has quantised value in the set K (2) 

k&K 


otherwise the sum is zero. 

Conversely, Method 2 may be preferred in linear modelling situations which would 
then represent the change in y for an incremental change in the quantised encoding, 
whereas in a linear modelling context Method 1 results in separate estimates of effect for 
each category of encoding. 


Note that this is not a binary representation of the kind critiqued in section 2.1 


here they are binary indicator values, with an integer representation in an integer space. 
Therefore, to count the number of indicators, for example, is simple addition, as opposed 
to the binary arithmetic described earlier. 

In the next two sections we present the tailored statistical machine learning techniques 
developed specihcally with the constraints of homomorphic encryption in mind. 


3 Extremely Random Forests 


Extremely or perfectly random forests (Geurts et al., 2006 Cutler and Zhao, 2001) can 


exhibit competitive classification performance against their more traditional counterpart 


(Breiman, 2001). Forest methods combine many decision trees in an ensemble classifier 


and empirically often perform well on complex non-linear classification problems. Tra¬ 
ditional random forests involve extensive comparison operations and evaluation of split 
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qnality at each level, operations which are either prohibitive or impossible to compnte 
homomorphically in cnrrent schemes. However, we show that a tailored version of ex¬ 
tremely or perfectly random forests can be compnted fnlly encrypted, where both htting 
and prediction are possible, with all operations performed in cipher text space. Moreover 
we highlight that the completely random natnre of the methods allows for incremental 
learning and divide-and-conqner learning on large data, so that massive parallelism can 
be employed to ameliorate the high costs of encrypted computation. In particular, this 
is demonstrated in a real 1,152 core cluster example in §5.4 


3.1 Completely Random Forests (CRF) 


To begin, we assume the training data are encoded as in Method 1 ( §2.2.2 ) so that the 
comparison identities in ([^ and ([^ can be used. In overview, the most basic form of the 
proposed algorithm then proceeds as follows: 


Step 1. Predictor variables at each level in a tree are chosen uniformly (“completely”) at 
random from a subset of the full predictor set. Additionally, the split points are 
chosen uniformly (“completely”) at random from a set of potential split points. 
Identity (|^ then provides an indicator variable for which branch a variable lies in, 
so that a product of such indicators provides an indicator for a full branch of a 
decision tree. Then ([^ enables the pseudo-comparison involved in counting how 
many observations of a given class are in each leaf of the tree. 

Step 2. Step 1 is repeated for each tree in the forest independently, using a random subset 
of predictors per tree, so that many such trees are grown. Each observation casts 
one vote per tree, according to the terminal leaf and class to which it belongs. Note 
that Step 2 can be performed in parallel as the trees are grown independently of 
one another. 


Step 3. At prediction the same identities as in Step 2 can be used to create an indicator 
which picks out the appropriate vote from each tree, for each class. 

The detailed algorithm is given in Appendix 

This algorithm is referred to as a ‘completely random forest’ since it takes the random 
growth of trees to a logical extreme with tree construction performed completely blindfold 


from the data. This is in contrast to, for example, extremely random forests (Geurts et al 


2006) where optimisation takes place over a random selection of splits and variables, and 
tree growth can terminate upon observing node purity or underpopulation. It is also 
different to perfectly random tree ensembles (Cutler and Zhao, 2001), where random 


split points are constructed between observations known to belong to different classes. 
Neither of those approaches can be directly implemented within the constraints of fully 
homomorphic encryption. 

The model returns an encrypted prediction. 


z/* CRF(5;, I/, X*) 

as a count of the vote§^for each class category, c, in message space i/i G C, with encrypted 
training data {x, y} and encrypted test prediction point x*. The user decrypts using the 

^The vote is the total number of training samples of category c laying in the same root node as 
prediction point x* across the trees. 
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private encryption key z* = I)ec{ks, k*) and forms a predictive empirical ‘probability’ as: 


Pc 



( 3 ) 


3.2 Cryptographic stochastic fraction estimate 


In conventional forest algorithms each tree gets a single prediction ‘vote’ regardless of 
the number of training samples that were present in a leaf node for prediction. This is in 
contrast to the above, where due to encryption constraints the algorithm simply counts 
the total number of training samples from each category falling in the leaf node of the 
prediction point, summed across all trees. The difficulty in matching to convention is 
that converting the number of training samples in a category to the vote of the most 
probable category for each tree is not possible under current FHE schemes, and would 
need to be done through decryption. 

To address this we propose a method of making an asymptotically consistent stochastic 
approximation to enable voting from each tree. This is done by exploiting the fact that 
the adjustment required can be approximated via an appropriate encrypted Bernoulli 
process by sampling with replacement. This stochastic adjustment can be computed 
entirely encrypted. 

There are several approaches to estimating class probabilities from an ensemble of 
trees. Perhaps most common is the average vote forest, which is not possible because 
comparisons between class votes to establish the maximum vote in a leaf are not possible. 
An alternative is the relative class frequencies approach, which appears also to be beyond 
reach encrypted because of the need to perform division and representation of values in 
(0,1). An obvious solution to the representation issue as already discussed in 12.2 is to 
say: 

votes for class c in leaf b of tree t = 


Npl 1 


N 

.YcPbc 

~ Pbc 

.YcPbc 


where [■] denotes rounding to the nearest integer, N is the number of training observa¬ 
tions, pI^ counts the number of training samples of class c laying in the 6th terminal leaf 
node of the tth tree (see Appendixfor details), giving a scale {0,..., A^} which can be 
represented encrypted, albeit seemingly still not computed due to the division. 

Note that J2cPbc — that the reciprocal of the second term above lies in (0,1) 

and can be treated as a probability, and recall that X ~ Geometric(p) E[X] = p~^. 

In other words, one can view an unbiased stochastic approximation to the fraction we 
require to be a draw from a Geometric distribution with probability 

This transforms the problem from performing division to performing encrypted ran¬ 
dom number generation, where the distribution parameter involves division, which ini¬ 
tially may seem worse. However, observe that each p\^ term arises from summing a binary 
vector from {0,1}'^: 


N 

Pbc = Pibc where := ViJixi G b) 

i=l 


Gonsequently, exchanging the order of summation, YhcPbc ^e treated as a sum of a 
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binary vector length N: 


|C| N |C| 

'^Pbc = Yl Pib where ^ 

C=1 2=1 C=1 

where \C\ is the nnmber of classes. In other words, the length N vector 77 *^ (for each 
t, b) is an encrypted seqnence of O’s and I’s with precisely the correct nnmber of I’s snch 
that blind random sampling with replacement from the elements prodnces an (encrypted) 

Y' p* 

Bernoulli process with success probability ■ Hereinafter, refer to this as simply r]i, 
the dependence on tree and leaf being implicit. 

Thus, the objective is to sample a Geometric random variable encrypted, but at 
this stage it is only possible to generate the encrypted Bernoulli process underlying the 
desired Geometric distribution. This hnally shifts the problem to that of counting the 
number of leading zeros in an encrypted Bernoulli process: in other words, resample with 
replacement from rji,... ,1]^ a vector of length M, say, and without decrypting establish 
the number of leading zeros. 

To achieve this it is possible to draw on an algorithm used in GPU hardware to 
determine the number of leading zeros in an IEEE floating point number, an operation 
required when renormalising the mantissa (the coefficient in scientihc notation). Let 
rji,, rjM be a resampled vector and assume M is a power of 2 (the reason being that 
this maximises the estimation accuracy for a hxed number of multiplications): 

1. For / e {0,... ,log 2 (M) - 1}: 


• Set T]i = r]i\/ r]i_ 2 i = r]i + r]i_ 2 i - r]ir]i_ 2 i V 2^ + 1 < i < M 

2. The number of leading zeros is M — Vi 

In summary, this corresponds to increasing power of 2 bit-shifts which are then OR’d 
with itself, all of which can be computed encrypted. 

Thus, an approximately unbiased (for large enough M) encrypted estimator of the 
desired fraction, , is M — h* + 1 upon termination of the above algorithm. 

It is important to note that the multiplicative depth required for this algorithm is M, 
and recall that multiplicative depth is restricted under current FHE schemes (Aslett, 


Esperanga and Holmes, 2015) if expensive cipher text refreshing is to be avoided. Hence, 


in practise typically the resample size will be restricted to a small value like M = 32 
even for large N datasets. However, this is desirable: it enables some shrinkage to take 
place by placing an upper bound of M on the fraction estimate. Thus, for a choice of 
M = 32, terminal leaves will in expectation have the correct adjustment if at least A 
of the training data are in that decision path of the tree. For example, in a training 
data set of size N = 1000, the stochastic fraction is correct in expectation for all leaves 
containing at least = 31 observations — fewer observations and the leaf votes will 
undergo shrinkage in expectation. 

Note in particular that GRFs are inherently discrete and the probability of regrowing 
exactly the same tree twice is not measure zero, so that asymptotically the same tree 
will be regrown inhnitely often with probability 1. If the encrypted stochastic fraction is 
recomputed in each new tree then asymptotically the correct adjustment will be made. 
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3.3 Further implementation issues 

We highlight a couple of additional implementation issues that are important for the 
practical machine learning of completely random forests on FHE data. 


3.3.1 Calibration 

The first point to note is that there is no calibration of the trees, or indeed the forest. 
Consequently there should be no presumption that argmax^{p* : c = 1,...,|C|} pro¬ 
vides the “best” prediction for class c under unequal misclassification loss. As such, the 
traditional training and testing setup is crucially important in order to select optimal 
decision boundaries according to whatever criteria are relevant to the subject matter of 
the problem, such as false positive and negative rates. This is the only step which must 
be performed unencrypted: the responses of the test set must be visible, though note 
that the predictors need not since step 3 for prediction is computed homomorphically. 


3.3.2 Incremental and parallel compntation 

One key advantage of CRFs is that learning is incremental as new data become available: 
there is no need to recompute the entire fit as there is no optimisation step, so that once 
used encrypted data can be archived at lower cost and moreover adding new observations 
has linear growth in computational cost. 

Indeed the whole algorithm is embarrassingly parallel both in the number of trees and 
the number of observations. One can independently compute the trees and data can be 
split into shards whereby is computed for each shard separately using the same seed 
in the random number generator for growing trees and then simply additively combined 
afterwards (a comparatively cheap operation). This is highlighted in a real-world large 
scale example in §5.4[ 


3.3.3 Theoretical parameter requirements for 

For a discussion of practical requirements for the parameter selection in homomorphic 
encryption schemes, see Appendix |B| 


Fan and Vercauteren (2012 


The CRF fitting, prediction and forest combination are all implemented in the open 
source R package EncryptedStats (Aslett and Esperanga[ |2015) and can be run on 


unencrypted data as well as data encrypted using the HomomophicEncryption ([Aslett 
2014a) package. These are briefly described in Appendix [Pj 


The next section introduces the second novel method tailored for FHE. 


4 Naive Bayes Classifiers 


The Naive Bayes (NB) classifier is a popular generative classification algorithm that mod¬ 
els the joint probability of predictor variables independently for each response class, and 
then uses Bayes rule and an independence assumption among predictors to construct 
a simple classifier (Ng and Jordan, 2002 Hastie et al, 2009, p.210). The advantages 


and disadvantages have been extensively described in the literature, for example (Rennie 


et al.. 2003). Although the independence assumption underlying NB is often violated. 


the linear growth in complexity for large number of predictors and the simple closed-form 




















Encrypted statistical machine learning 


11 


expressions for the decision rnles make the approach attractive in “big-data” sitnations. 
Moreover as highlighted by Domingos and Pazzani ( 1997| there is an important distinc¬ 
tion between classihcation accuracy (predicting the correct class) and accurately estimat¬ 
ing the class probability, and hence NB can perform well in classihcation error rate even 
when the independence assumption is violated by a wide margin. Essentially, although 
it produces biased probability estimates, this does not necessarily translate into a high 
classihcation error (Hand and Yu, 2001). Consequently, NB remains a well established 
and popular method. 


The Naive Bayes framework 

Consider the binary classihcation problem with a set of P predictors x = p ^ ^ 

and let y G {0,1}. The NB classiher uses Bayes theorem, 

P(|/|x) oc P(a:||/)P(?/), (4) 

for prediction coupled with an independence assumption, 

p 

1=1 


which embodies a compromise between accuracy and tractability, to obtain a conditional 
class probability. This allows NB to separately model the conditional distributions of pre¬ 
dictor variables, {P(a;|j/ = l),P(x|j/ = 0)}, and then construct the prediction probability 
via Bayes theorem. 


= l|x) 


F{x\y = l)P(i/ = 1) 

F{x\y = l)P(i/ = 1) + F{x\y = 0)P(|/ = 0) ’ 


( 6 ) 


The most popular forms for the distributions F{xj\y) are multinomial for categorical 
predictors and Gaussian for continuous predictors. As shown in the Appendix it is 
possible to work with multinomial distributions directly in cipher text space, albeit at 
a multiplicative depth of 3P — 1, but the Gaussian distributions lay outside of FHE. In 
the next subsection we propose a tailored semi-parametric NB model that is amenable 
to cipher text computation. Grucially this novel method scales to an arbitrary number 
of predictors at no additional multiplicative depth, making it well suited to encrypted 
computation. 


4.1 Semi-parametric Naive Bayes 


The NB classiher solves the classihcation task using a generative approach, i.e., by mod¬ 
elling the distribution of the predictors (Ng and Jordan, 2002). However, distributions 


such as the Gaussian cannot be directly implemented within an FHE as they involve 
division and exponentiation operators as well as continuous values. Here, we show that 
it is possible to model the decision boundary between the two response classes more ex¬ 
plicitly — without a parametric model for the distributions of the predictors — while 
still remaining in the NB framework. As will become clear, this corresponds to a discrim¬ 
inative approach to classihcation where the decision boundary of the conditional class 
probabilities is modelled semi-parametrically. 
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To begin, note that the expression for the log-odds prediction from NB can be rear¬ 
ranged to give 


= l|x) \ 
\F{y = 0\x)J 
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Using this identity, we now propose to model the decision boundary F{y\xj) directly using 
the linear logistic form, rather than parameterise F{xj\y) via a distribution function. That 
is we assume. 




and 


P(?/ = 0) 

F{y = 1) 


= e 


( 8 ) 


where, in this work, f{xj; aj, (3j) is taken to be a linear predictor of the form aj + /3jXj. 
The independence structure means that each term can be optimised independently by 
way of an approximation to logistic regression, amenable to homomorphic computation 


(presented below; (:4.2), since the standard iteratively reweighted least squares htting 
procedure is not computable under the restrictions of homomorphic encryption. Optimi¬ 
sation of 9 is done independently of {aj,(3j}, 


^i=l Vi 


9 = 


N 


^i=l Vi 


(9) 


The estimated log-odds are then 

9, d, /3) = log [ 9, aj, 4) (10) 

\F{y = 0\x-,9,a,/3} J ^ 

or, equivalently, in terms of conditional class probabilities 

= l\x\9,a, B) = - ^^^—. 

1 -|- exp{—'?/l(a;; 9^ d, (B)} 

Hence equation ( [IT| ) can be computed after decryption from the factors (numerator/de¬ 
nominator) that comprise {6',d,/3} to form the conditional class predictions. 

This dehnes a semi-parametric Naive Bayes (SNB) classihcation model that assumes a 
logistic form for the decision boundary in each predictor variable, and where each logistic 
regression involves at most two parameters. As we will show in the next section, in this 
setting there is a suitable approximation to the maximum likelihood estimates which are 
traditionally computed using the iteratively reweighted least squares algorithm. 
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4.2 Logistic Regression 

The SNB algorithm proposed in the previous section requires a homomorphic implemen¬ 
tation of simple logistic regression involving an intercept and a single slope parameter. 
In this section an approximation to logistic regression based on the hrst iteration of the 
Iteratively Reweighted Least Squares (IRLS) algorithm is proposed and some of its the¬ 
oretical properties analysed. Apart from its use in SNB, the approximation to logistic 
regression described in this section also stands on its own as a classihcation method 
amenable to homomorphic computation. 

4.2.1 First step of Iteratively Reweighted Least Squares 

Optimisation in logistic regression is typically achieved via IRLS based on Newton- 
Raphson iterations. Starting from an initial guess the hrst step involves updating 
the auxiliary variables 


f ( 0 ) 

exp [T]l 

l+exp(^r/®) (12) 

h® + (vi - 

where Xj. denotes the fth row of X. By starting with an initial value = (0,..., 0)^ 
the initialisation step of the algorithm is simplihed: for all i G Ni: 7 v we have ? 7 ® = 0, 
/i® = 1/2, =1/4 and zf''^ = Ayi — 2. In the second step, the parameter estimates of 

[3 are updated using generalised least squares 

/^d) = (13) 

and these two steps are repeated until convergence is achieved. 

In what follows, only the hrst-iteration update of (3 is considered, because it pro¬ 
vides an approximation to logistic regression and, most importantly, one which is com¬ 
putationally feasible under homomorphic encryption; this will be termed the one-step 
approximation. Note that implementation of the full IRLS algorithm is infeasible un¬ 
der FHE because the weights wa can not be updated as this requires the evaluation of 
non-polynomial functions. 

The particular form of these {paired) one-step approximating equations for an in¬ 
tercept and a single slope parameter is shown in the following section. We show an 
additional simplihcation (so-called unpaired estimates) which improves encrypted com¬ 
putational characteristics further at the cost of greater approximation. 

4.2.2 Paired, one-step approximation to simple logistic regression 

In the SNB model the conditional log-odds are optimised for each variable separately, 
following the independence assumption. In this case the the design matrix X contains a 


(0) 

m = 




(0) 

< = 


(0) 

2 :} = 
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single column of I’s for the intercept and a single predictor variable column, and the first 
step of the IRLS leads to 


and where 


{aj Jj} = {aj / dj , bj/dj} , Vj e Ni,p 


% = I 
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(14) 


(15) 

(16) 

(17) 


T o us e the one-step approximation in combination with the SNB classiher detailed above 
(^4.1), the quantity {aj + /3jX*} — required for the classification of a new observation 

= {Xj}jgNl;P 


X 


can then be estimated as ej/dj, where = aj 


+ b^x*. 


This corresponds to a standard, paired optimisation strategy, that is, to optimise 
intercept and slope jointly, using an approximation targeting 


aigsup f{xj-,aj,/3j), Vj G N^p. (18) 

An alternative to this approach would be to optimise intercept and slope indepen¬ 
dently (i.e., in an unpaired fashion), that is, targeting 

argsup f{xj]aj,(3j = 0) and argsup f{xj-,aj = 0,/?^), Vj G Ni:P (19) 

in which case all aj take the same value and, therefore, this is equivalent to estimating 
all (3j independently and including a global intercept, Paj for some j. 

We will see that this is computationally appealing due to the simplihed estimating 
equations when using the same one-step approximation. 


4.2.3 Unpaired, one-step approximation to simple logistic regression 

In the case of unpaired estimation (or in the absence of an intercept term) the estimating 
equations for aj and fdj are simpler. To distinguish between unpaired and paired estimates 
we denote the unpaired by aj and /3j, respectively. All aj have a common form 



2=1 


( 20 ) 


while fdj have the form 


5 ^ Ei.. 


( 0 ) 


x^ 

Z^i=i 


( 21 ) 


Note that this unpaired formulation also arises in the case of centred predictors, 
E[xj] = 0, so that the unpaired approach is completely equivalent to the paired approach 
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Fignre 1: Shrinkage in (3. Because the estimator f3 is bounded between -2 and 2, the 
shrunk value is always equal to 2 for absolute values of f3 greater than 2.5. 

and introduces no additional approximation if it is possible for the data to be centred 
prior to encryption. Note that this is trivially achievable for ordinal quintile data by 
representing each quintile by {—2, —1, 0,1, 2}, rather than {1,2, 3,4,5}. Consequently, 
where it is possible to centre the data it should be done in order to take advantage of 
these computational benehts at no approximation cost. 


Furthermore, when the data are represented as in section 2.2.2 this can be rewritten 
as follows. Dehne the following auxiliary variables 


4 = 






N 

Vii 'IT'jk ^ ^ ^ijk ; 

i=l 


[ 0 ] [ 1 ] 
^jk = ^jk - nyi 


( 22 ) 


[cl 

In words, counts the number of observation in the fcth bin of the jth predictor —or, 
equivalently, the number of elements of X.j (the jth column of X) which are equal to 
k —and for which the corresponding response is equal to c, for c G (0,1}. In this case. 


Equation (21) becomes 


/3, = 2 


Mj 


N 


-1 


k‘^ ^ Xijk 


Mi 


k=l i=l 


,hi 


2^^[^jk-^jk 
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(23) 


SO that for a binary predictor Xij G (0,1} we hnd 


ft = - "i 


(24) 


where n = This simple expression aids in the derivation of some theoretical 

results regarding the bias of the estimator. 


Shrinkage of (3 

Let f3 denote the true parameter and f3 denote the one-step IRLS estimate from Equation 


(24). The one-step “early stopping” shrinks or “regularises” the estimate towards the 
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Figure 2: (Left) For all p, the generalisation error shrinks monotonically with increased 
n and converges to an asymptotic curve—in green; (Right) The rate at which the gen¬ 
eralisation error curves approach the asymptotic curve is decreasing with n for all values 
of p (several shown here) and stabilises at around n = 20. 
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The shrinkage, as a function of the true parameter /?, is shown in Figure [Tj it is negligible 
when 13 is small, but increases linearly with it (in magnitude) for f3 outside the interval 
(—2,2). The reason for this is clear from the formula in Equation (24): the range of f3, 
the one-step estimate being (—2,2). 

In particular, note that this shrinkage is a highly desirable property in light of the 
independence assumptions made in SNB. Indeed, the one-step procedure empirically out¬ 
performs the full convergence IRLS when predictors are highly correlated and moreover 
does not signihcantly underperform otherwise (tests were performed on all datasets to be 
presented in Therefore the one-step method is not only a computational necessity 
for computing encrypted, but offers potential improvements in performance against the 
standard algorithm. 


Generalisation error 

Dehne p = F[y = l\x = l,/3] and p = F[y = l|x = l,/3]. Then, the generalisation error 
for X = 1 can be written as 


E[p — p] =E 

and approximated by the polynomial 

E[p — p] E 
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Using the moment generating fnnction for the Binomial distribntion one can compnte the 
higher-order expectations and thus arrive at a formula which depends only on n and p, 

E[p — p] -^ (Sn^p^ — 12n^p^ -|- Qri^p — — 24np^ 

6n^ ^ (28) 

-|-36np^ — 12np -|- 16p^ — 24p^ -|- 8p) . 

The expression is dominated by p in the sense that even for very small values of n, 
the generalisation error is close (according to our approximation) to the one obtained 
asymptotically, 

lim E[p — p] = —(2p — 1)^. (29) 

n—^oo 6 

Figure (left) shows how the value of n affects the generalisation error; and Figure 
(right) shows the speed at which the generalisation error converges to the one given by 
the (approximate) asymptotic curve, for several values of p. 

The generalisation error results highlight that while the estimates of the true class 
conditional probabilities may be unstable, the classihcation error rate achieved by SNB 
may be low as the classiher only has to get the prediction on the correct side of the 
decision boundary. 


4.2.4 Theoretical parameter requirements for 

For a discussion of practical requirements for the parameter selection in homomorphic 
encryption schemes, see Appendix |B| 


Fan and Vercauteren (2012 


The SNB htting and prediction for both paired and unpaired approximations are im¬ 
plemented in the open source R package EncryptedStats ( |Aslett and Esperanga 2015| ) 
and can be run on unencrypted data as well as data encrypted using the HomomophicEncrypt 
(Aslett, 2014a) package. These are briefly described in Appendix [Pj 

In the next section, the two new machine learning techniques tailored for homomorphic 
encryption which have been presented are empirically tested on a range of data sets and 
a real example using a cluster of servers to £t a completely random forest is described. 


5 Results 

In this section we apply the encrypted machine learning methods presented in Sections 
and 1^ to a number of benchmark learning tasks. 


5.1 Classification performance 


We tested the methods on 20 data sets of varying type and dimension from the UCI ma¬ 


chine learning data repository (Lichman, 2013), each of which is described in Appendix 
[El For the purposes of achieving many test replicates, the results in this subsection were 
generated from unencrypted runs (the code paths in the EncryptedStats package for 
unencrypted and encrypted values are identical), with checks to ensure that the unen¬ 
crypted and encrypted versions give the same results. Runtime performance for encrypted 
versions are given below. 

Figure shows the comparison of these novel methods with each other as well as 
with their traditional counterparts. The traditional methods included are full logistic 
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Figure 3: Performance of various methods. For each model and dataset, the AUC for 100 
stratihed randomisations of the training and testing sets; the horizontal lines represent 
the frequency of class y = 1; an asterisk indicates the method can be computed encrypted. 


regression (LR-full), Gaussian naive Bayes (GNB) and random forests (RF), none of which 
can be computed within the constraints of homomorphic encryption. The methods of this 
paper included are, completely random forests (CRF), paired (SNB-paired) and unpaired 
(SNB-unpaired) semi-parametric naive Bayes, and multinomial naive Bayes (MNB). The 
CRFs are all 100 trees grown 3 levels deep, including stochastic fraction estimate (M = 8). 

The performance measure used is the area under the ROC curve (AUC, ranging from 
0 to 1). For each model and dataset the algorithms were run with the same 100 strat¬ 
ihed randomisations of the training and testing sets (split in the proportion 80%/20%, 
respectively), so that each point on the graph represents the AUC for one train/test split 
and one method. 

The hrst two data sets (infl and neph) are very easy classihcation problems and 
the new techniques match the traditional techniques perfectly in this setting, keeping 
almost uniformly good pace in the other rather easy data sets (bcw_d, bcw_o, monksS), 
the unpaired SNB being the exception on bcw_d. 
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Unsnrprisingly, the traditional random forest tends to perform best in the more chal¬ 
lenging data sets ( |Fernandez-Delgado et al. 2014), thongh only in 4 of the data sets is 
it clearly ontperforming all the other methods by the AUC metric. Indeed, in the most 
challenging data sets (blood, bcw_p and haber) the new methods proposed in this work 
exhibit slightly better average performance than their connterparts. 

The results of the unpaired SNB were performed without centring (which would be 
equivalent to paired) and affirms the observation in the previous section that centring or 
paired computation is always to be preferred where available. 

The SNB method with IRLS run to convergence is not presented in the hgure: as 
alluded to in the previous section, the natural shrinkage of the one-step estimator meant 
it performed equally well in most situations and in the chess and muskl cases the average 
ratio of full convergence over one-step AUCs was 0.76 and 0.67 respectively. 

As an aside, the unencrypted version of these new methods have good computational 
properties which will scale to massive data sets, because the most complex operations in¬ 
volved are addition and multiplication which modern CPUs can evaluate with a few clock 
cycles latency. In addition, in all cases even these simple operations can be performed in 
parallel and map directly to CPU vector instructions. 


5.2 Timings and memory use 

All the encrypted methods presented in this work can be implemented to scale reasonably 
linearly in the data set size, so performance numbers are provided per 100 observations 
and per predictor (for logistic regression and semi-parametric naive Bayes) or per tree 
(for completely random forests). 

For reproducibility the timings were measured on a c4.8xlarge compute cluster Ama¬ 
zon EC2 instance. This corresponds to 36 cores of an Intel Xeon E5-2666 v3 (Haswell) 
CPU clocked at 2.9GHz. Table shows the relevant timings using the EncryptedStats 
and HomomorphicEncryption packages. 

Table 1: Approximate running times (in seconds, on a c4.8xlarge EC2 
instance) per 100 observations. SNB are per predictor; CRF is per tree, 

L is the tree depth grown. (An ‘encrypted value’ is a single integer 
value encrypted, so for example using quintiles each Xij is stored as 5 
encrypted values.) 



SNB-paired 

CRF, L = 

Model 

= 1 CRF, L = 2 

CRF, L = 3 

Fitting 

18.0 

12.5 

45.2 

347 

Prediction 

7.8 

15.1 

48.3 

353 

Approx memory per 
encrypted value 

154KB 

128KB 

528KB 

1,672KB 


Note that the CRF does not scale linearly in the depth of the tree grown, not only 
because of the non-linear growth in computational complexity for trees but also because 
as L increases so the parameters used for the Fan and Vercauteren (2012) scheme have to 
be increased in such a way that raw performance of encrypted operations drops. This drop 
is due to the increasing coefficient size and polynomial degree of the cipher text space. 
See Appendix for a discussion of the impact of tree parameters on the encryption 
algorithm. 
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5.3 Forest parameter choices 

The completely random forest has a few choices which can tnne the performance: nnmber 
of trees T, depth of trees to bnild L and whether to nse stochastic fractions (and to what 
npper estimate, M). An empirical examination of these now follows. 


5.3.1 Forest trees and depth 

We explored the effect of varying the nnmber and depth of the trees nsed in the algorithm, 
by varying T from 10 to 1000 and L from 1 to 6. We fonnd a clear trend in most cases 
that growing a large forest is mnch more important than tall trees: indeed, in many cases 
1000 tree forests with 1 level perform eqnally well to those with 6 levels. Fignre in 
Appendix plots the resnlts. The depth of trees only appears relevant in large forests 
for the iono, magic and two musk data sets. This may indicate the data sets with more 
non-linearities, a hypothesis snpported by the poor performance of the linear methods in 
these cases. Indeed, 1 level deep trees appear to be good candidates for additive models. 


5.3.2 Stochastic fraction 


The CRF algorithm presented in ^ has the option, presented in (3.2 of inclnding an 


nnbiased encrypted stochastic fraction estimate in order to rednce the amonnt of shrinkage 
that small leaves nndergo. To analyse the impact that this has on the performance, the 
AUC was recompnted after setting different valnes for M from 0 (i.e. original algorithm, 
no stochastic fraction) throngh to 64 in all 100 train/test splits of the 20 datasets that 
were presented already. 

Fignre 1^ shows the results from the data set with the largest improvement in AUC 
performance from among the 20 data sets when using the stochastic fraction estimate. 
Improvements in AUC of up to 12.6% were achieved using the stochastic fraction versus 
omitting it. All points which are above the y = x line indicate improvements in AUC 
when using the stochastic fraction for the particular train/test split. 

None of the 20 tested datasets decreased in average AUC with increasing M, but 
Figure 1^ shows the data set for which the AUC was least improved. In this instance it is 
striking that all the points cluster around the y = x line, showing that in essence in the 
worst case the stochastic fraction estimate has essentially negligible impact. 

These figures empirically illustrate the fact that the stochastic fraction has the po¬ 
tential to dramatically improve the performance of completely random forests, whilst not 
really having a negative impact in those situations where it does not help. As such, it 
would seem to make sense to include by default. 


5.4 Case study in encrypted cloud computing machine learning 

To demonstrate the potential to utilise cloud computing resources for sensitive data anal¬ 
ysis we undertook a benchmark case study performed with fully encrypted data analysis 
on the original Wisconsin breast cancer data set using a compute cluster of 1152 CPU 
cores on Amazon Web Services, at a total cost of less than US$ 24 at the time of writing. 
The resources used here are readily available to any scientist. 
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M=2 M=4 M=8 



M = 0 


Fignre 4: The AUC change for iono data set with different valnes for M in the stochastic 
fraction estimate, x-axis is always with no stochastic fraction estimate; j/-axis is for shown 
valne of M; one point per train/test split. 


5.4.1 The problem setup 


As mentioned earlier in §3.3.2 the completely random forest is amenable to embarrass¬ 


ingly parallel compntation, whereby the data can be split into shards using the same 
random seed, the expensive htting step performed on each shard, and the hnal £t pro¬ 
duced by inexpensively summing the individual tree shards. 

The training part of the Wisconsin data (n = 547) were initially split into shards of at 
most 32 observations each, resulting in 17 full 32-observation shards and an 18th shard of 3 
observations. These shards were then each encrypted using the HomomorphicEncryption 
R package (Aslett, 2014a) under the Fan and Vercauteren scheme using parameters: 


4 > = 1 

q = 2 ^ 2 ^ 

= 26959946667150639794667015087019630673637144422540572481103610249216 
t = 200000 
cr = 16 


This renders a theoretical cipher text size of 2 x 8192 x 224 (8 x 1024) = 448KB per 

encrypted value, ignoring keys and overhead. These parameters offer around 158-bits of 
security (using bounds in Lindner and Peikert ( |2011 )) — informally this means on the 
order of 2^^® fundamental operations must be performed on average in order to break the 
encryption. On disk, each gzipped shard of 32 cipher texts for the predictors occupied 
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M = 0 


Figure 5: The AUC change for bcw_p data set with different values for M in the stochastic 
fraction estimate, x-axis is always with no stochastic fraction estimate; j/-axis is for shown 
value of M] one point per train/test split. 


about 737MB and for the responses occupied about 33.7MB, for a total disk space of 
around 13.8GB. 

This data was uploaded to an Amazon S3 bucket, with the transfer time using the 
University internet connection being approximately 16 minutes. If this data was to be 
stored long-term on Amazon S3, it would cost US$ 0.42 per month at the time of writing. 

Once the data was in place, an Amazon SQS queue was setup in which to store a ref¬ 
erence to each shard. This queue acts as a simple job dispatch system, designed to ensure 
that each server in the cluster can remain completely independent for maximum speed, 
by eliminating inter-server communication and as a mechanism to ensure no duplication 
of work. 

With these elements in place, the RStudio AMI ami-628c8a0a (Aslett, 20146) was 
extended to add a startup script which (in summary) fetches the work to perform from the 
SQS queue, downloads it from S3, executes the forest building using the EncryptedStats 
R package, and uploads the result to the S3 bucket. 


5.4.2 The fitting run 

The fitting run used Amazon’s spot instances: these are a ‘stock market’ for unused 
capacity, where it is often possible to bid below the list price for compute servers. The 
completely random forest is well suited to exploiting low spot prices on EC2 wherever 
they may arise because it can be formulated in an embarrassingly parallel manner and 
launched in very geographically dispersed regions without regard for connectivity speeds. 
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since communication costs between nodes are effectively zero. 

When the run was performed on 5**^ May 2015, the spot price for c3.8xlarge instances 
was lowest in Dublin, Ireland and Sao Paulo, Brazil. Consequently, the data was repli¬ 
cated to two S3 buckets local to these regions and the customised AMI copied. Then a 
cluster of 18 c3.8xlarge servers was launched in each region, giving a total of 1,152 CPU 
cores and 2,160GB of RAM. 

Each server was setup to compute 50 trees on its shard of data and every shard was 
handed out twice so that a total of 100 trees were htted. Tree growth for the two different 
sets of 50 trees were initialised from a common random seed, eliminating the need for 
servers to communicate the trees grown. 

After 1 hour and 36 minutes the cluster had completed the full run and hnished 
uploading the encrypted tree £t (that is, encrypted versions of V t, b, c) back to the 
S3 bucket. The total space required for storing the 36 forests of 50 trees fitted on each 
shard was 15.6GB. At this juncture, the forests could be combined homomorphically to 
produce a single forest of 100 trees which would then require 868MB to store. Note that 
with the tree fitted, it would then be possible to archive the 13.8GB of original data so 
that only 868MB needs to be stored long term or downloaded. 

The cost of the 36 machines for 2 hours was US$23.86 (about £15.66). Note that 
the spot prices were not exceptionally low on the day in question and no effort was made 
to select an opportune moment. By the same token the price is inherently variable and 
it may be necessary to wait a short time for favourable spot prices to arise if there are 
none at the time of analysis. 

5.4.3 Results 

The encrypted version of the forest was downloaded, decrypted and compared to the 
results achieved when performing the same fit using an unencrypted version of the 
data, starting tree growth from the same seeds and using identical R code from the 
EncryptedStats package (separate code is not required due to the HomomorphicEncryption 
package fully supporting operator overloading). The resultant fit from both encrypted 
and unencrypted computation was in exact agreement. 


6 Discussion 

Fully homomorphic encryption schemes open up the prospect of privacy preserving ma¬ 
chine learning applications. However practical constraints of existing FHE schemes de¬ 
mand tailored approaches. With this aim we made bespoke adjustments to two popular 
machine learning methods, namely extremely random forests and naive Bayes classifiers, 
and demonstrated their performance on a variety of classifier learning tasks. We found 
the new methods to be competitive against their unencrypted counterparts. 

To the best of our knowledge these represent the first machine learning schemes tai¬ 
lored explicitly for homomorphic encryption so that all stages (fitting and prediction) can 
be performed encrypted without any multi-party computation or communication. 

Furthermore, the unencrypted version of these new methods will scale to massive 
data sets, because the most complex operations involved are addition and multiplication. 
Indeed, even these simple operations can be performed in parallel for most of the presented 
algorithms and map directly to CPU vector instructions. This is an interesting avenue of 
future research. 
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A Completely Random Forest (CRF) algorithm 

In detail, consider a training set with N observations consisting of a categorical response, 
yi G C, and P predictors, Xip, (categorical / ordinal / continuous), for i G {1,..., N},p G 
{1,... ,P}. All variables (predictors and response) are hrst transformed using method 
1 from ^2.2.2 prior to encryption. Thus, yi —)• yic G {0,1} for c G {1,...,|C|} and 


X. 


ip —)■ Xipk- Consider these herein to be in encrypted form. 

The proposed algorithm for Completely Random Forests (CRFs) is then as follows: 

Algorithm 

1. Specify the number of trees to grow, T, and the depth to make each tree, L. 

2. For each f G (1,..., T}, build a tree in the forest: 

(a) Tree growth: For each I G (1,..., L}, build a level: 

i. Level I will have 2^“^ branches (splits), each of which will have a partition 
applied as follows. For each 6 G (1,..., 2^“^}, construct the partitions: 

A. Splitting variable: Select a variable ptib at random from among 
the P predictors. Due to the encoding of 12.2.2, 
associated with it. 


this variable has a 

partition 

B. Split point: Create a partition of JCp^^^ at random in order to perform 
a split on variable pub. Pm = {Df^, -^2 where each = [J Af*"” for 
some G JCp^,,, with n = 0 and U Df = (Jv* 

Note that for categorical predictors this is a random assignment of 
levels from the partition JCp^j^ to each while for ordinal predictors 
a split point is chosen and the partition formed by the levels either 
side of the split. 

Note also the indexing of Pm to emphasise if variable pm is selected 
more than once (in different levels or trees) a different random split is 
chosen. 

(b) Tree fitting: The total number of training observations belonging to category 
c in the completely randomly grown tree t at terminal leaf b G (1,..., 2'^} is 
then: 

V . / \ 


Pbc = 




2=1 


1 = 1 










where 


gib,l) := 
h{b,l) := 


2L+1-1 

{b — 1) mod 2^+^“* 


+ 1 


Figure (page is useful for understanding this. 
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Note in particnlar that written this way is simply a polynomial and can 
be compnted homomorphically with mnltiplicative depth L. Both g{-, ■) and 
involve only indices of the algorithm which will not be encrypted. Thus, 
the training data can be evaluated on the tree without the use of comparisons. 
The total number of training observations in this terminal leaf is then simply: 

|C| 


Thus a single tree, t, htted to a set of training data, is dehned by the tuple of 
sets: 


{pl:b={l,...,2^-^},c = {l,...,\C\}} 


3. Prediction: Once the forest has been grown, attention turns to prediction. Given 
an encrypted test observation with predictors 5;*^, the objective is to predict the 
response category. Dehne y* to be the number of votes for response category c, 
which can be simply computed as: 




\ 



X 


★ 

Ptlg{b,l) 


\keD 


tlg(b,l) 

h{b,l) 




where g{-, •) and h(-, •) are as dehned above. Each i/* is returned from the cloud to 
the client. The client decrypts and forms a predictive empirical ‘probability’ as: 


Pc 



(30) 


Hence, the proposed CRF algorithm makes use of a quantisation procedure on the 
data, followed by completely random selection of variable and completely random parti¬ 
tion on the quantile bins, in order to eliminate any need to perform comparisons. 

Tree growth (2a) occurs unencrypted and is a very fast operation. 

Tree htting (2b) involves counting the number of training observations which lie in 
each terminal node of each tree. In computing pl^, the inner sum evaluates whether an 
observation has predictor value in the relevant partition at level I of tree t (by (|^). The 
product over all levels then results in 1 if and only if the observation lies in this path 
through the tree. Finally, multiplication by ensures the observation only counts if it 
belongs to response category c. 

If growing a tree encrypted is the objective then at this point the resulting pl^ can be 
decrypted. Alternatively, without decrypting it is possible to immediately predict with 
the tree, using a similar procedure of evaluating an effectively binary circuit representation 
of the tree and data. 
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Figure 6: Completely Random Forest algorithm depiction. The dashed line shows the 
evaluation of a training/testing observation via the g and h functions, b signihes the 
branch number at the given level 1. 
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B Theoretical parameter requirements for Fan and 


Vercauteren (2012) 


B.l Completely Random Forests parameter requirements 

In order to be able to fit extremely random forests of T trees each of depth L encrypted 


nnder the scheme of Fan and Vercanteren (2012), the parameter choice must support at 


least L-depth multiplications. Additionally, the message space must support coefficient 
values up to size T max{^. : c = 1,..., \C\}, since all values will accumulate on the 
hrst coefficient of m{x). This covers the most extreme (and improbable) possibility that 
a leaf in every tree contains all the observations for a particular class. 

In the event that the stochastic fraction variant is used, the parameters must support a 
total of at least L+M multiplications. If prediction is to be performed without decrypting 
the tree or refreshing the noise in the tree cipher text, then the scheme must support an 
additional L multiplications on top of the values above. 


The function parsHelp in the HomomorphicEncryption package (Aslett, 2014a) can 


be used to guide selection of the scheme parameters d, f, g, and a which will enable the 
multiplicative depth and maximum coefficient values required. 

B.2 Semiparametric Naive Bayes parameter requirements 


In order to be able to £t SNB encrypted under the scheme of Fan and Vercauteren (2012) 


(Aslett, Esperanga and Holmes, 2015), the coefficients {cj, must be returned 

for offline assembly via Equation (11) (because homomorphic divisions are not possible), 
together with the empirical class frequencies. The former have multiplicative depth equal 
to 4 and 3, respectively, and the later equal to 0. With respect to the growth in mes¬ 
sage coefficient size for the FandV scheme, the biggest growth comes from computing dj 
in Equation (pT|. As described in detail in Aslett, Esperanga and Holmes (2015), ini¬ 


tially the polynomial representation of all Xij will have maximum coefficient value of 1. 
Therefore, by the multinomial theorem the maximum coefficient value of the polynomial 
representation of xfj will be 2. Thus the maximum coefficient for the hrst term is 2N. 
In contrast, the maximum coefficient of t)e N, so that by the multinomial 

theorem the maximum coefficient value of the polynomial representation of 
be 2N‘^. Therefore to ensure correct decryption after htting the nave Bayes model the 
message space must support coefficient values up to size 2N‘^. 

The exclusion of the intercept term has an impact on these requirements and, con¬ 
sequently, on the running time of the algorithm. Precisely, numerator and denominator 
of fd have multiplicative depth equal to 2 and 1, respectively, in this case. The message 
coefficient requirements will also be signihcantly relaxed, with the maximum coefficient 
value being 2N. 


C Multinomial Naive Bayes algorithm 

The Multinomial Naive Bayes (MNB) classiher estimates F{xj\y) and P(?/) using the 
corresponding empirical probabilities. 


Pxj=k\y=c 


#{A., =kAY = c} 

ny = c} 


and 


Py=C — 


*{y = c} 


N 


(31) 
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where ij^{z} is the cardinality of the set satisfying condition z; k denotes the kth pos¬ 
sible value taken by Xij G and X.j and Y are A^-dimensional vectors of obser¬ 

vations for predictor j and response, respectively. The “Multinomial” qualiher should 
be understood to mean precisely that Xij takes values in the discrete set Ni:Mj and so 
A” = Ni;Mi X ■ ■ ■ X 

Because it is not possible to perform homomorphic divisions, the algorithms will be 
written in terms of unnormalised probabilities, or counts: 

= c] and p'"i = #{y = c}. (32) 


Estimation, therefore, requires the computation of and for all possible values 
of y and Xj, and for all j G Ni:p. This corresponds to a 2 x Mj table of conditional 
probabilities for each predictor j G Npp; plus two class probabilities. In total, this makes 
2 -|- 2 ■ parameters. 

A computational problem arises from the fact that can not be computed directly: 
since the values Xtj are encrypted, this makes it impossible to count, say, the number of 
2s in the v ector X .,- directly. An indirect route follows from the hrst quantisation method 
in Section 2.2.2 Let X denote the binary-expanded version of X; and Xijk denote the 


/cth bin of Xij, = (xjji,... ,XijMj)- Then, the conditional counts can be computed as 


^(n) 

Pxj=k\y=l 


N 

i=l 


and 


^(u) 

Pxj=k\y=0 


N 


^ ^ Vi)- 


i=l 


(33) 


and the class counts as 


4= 


N 


PyX = 2^ y^ 

i=l 


and 


N 


P^;2o = N-J2: 


i=l 


(34) 


The multiplicative depth of p^“^ is 0, while that of p22\y is 1. As standard practice, imple¬ 
mentations should use Laplace-smoothed counts or probabilities to avoid the problems 
arising from having zero counts for some values of k. 

Classihcation of an observation X* = requires the computation of the 

factors 


p 


Fc = P(p = c) = c) 

i=i 

=iv-(e)^'""’nh.u 


i=i 


(35) 


for c 6 {0,1}. If X* 

indirectly as 


is known only in encrypted format, then p 


(u) 

x;\y 


=c 


must be computed 



M, 

E 

k=l 


"^jk Pxj=k\y=c 


(36) 


where X* = is the binary expanded version of x*] and which has multiplica¬ 

tive depth equal to 2. The predicted probability of X* belonging to class p = 1 is then 
P(p = 1|X*) = Fi/{Fq + Fi), a hnal step which must be computed offline. 
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It should be noted that homomorphic MNB requires the use of binary-expanded data 
(X) as opposed to integer data (X). This has a direct impact on the computational cost 
(as it leads to a larger number of homomorphic operations; recall Equations (33) and 
(36)) and memory requirements (as the data must be represented in the binary-expanded 
form). 

In the next Section (^4.1) an alternative approach is suggested which mitigates some 
of these problems, while still relying on NB assumption. 


Theoretical parameter requirements for 

( |20l^ 

For MNB, the elements to be returned are py=c and for j G Ni;P and c G {0,1}, 

with the former having multiplicative depth equal to 0 and the later equal to 2. 

It is possible reduce the number of elements to be returned to 4 by computing 
(p'it)'’-' and nj. -iP^x*\y=c homomorphically but this comes at the cost of an increase 
in the multiplicative depth to P — 1 and 3P — 1, respectively. 
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D Open source software contributions 


We have contributed two open source R packages which make working with the techniques 
described and contributed herein easy. 


The hrst, HomomorphicEncryption (Aslett 


(Aslett 

2014a 

), implements the 

Fan and Ver- 


cauteren (2012) homomorphic encryption scheme in a generic framework which can be 


extended to plug in other encryption schemes over time. The package is written in high 
performance C-|--|- and utilises full operator overloading in R. Indeed, all common types 
such as vectors and matrices are implemented so that they can be transparently encrypted 
and manipulated just like their unencrypted counterparts. The theoretical security and 
multiplicative depth bound results of different papers are wrapped in a helper function 
to aid in parameter choice. 


The second, EncryptedStats (Aslett and Esperanga, 2015), implements the tailored 


machine learning methods of this paper in such as way as to enable htting and prediction 
on unencrypted and encrypted data using precisely the same code paths (thanks to the 
operator overloading support of HomomorphicEncryption). 


D.l Introduction to the HomomorphicEncryption package 

There are three steps to encrypting data using the package: i) parameter selection; ii) 
key generation; and iii) encryption/decryption. Once these are understood, then homo¬ 
morphic operations follow naturally from operator overloading. 

Parameter selection 

The scheme of 

In the package these are referred to as: 

• d, the power of the cyclotomic polynomial ring (default 4096); 


Fan and Vercauteren (2012) has four parameters which must be selected. 
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• sigma, the standard deviation of the discrete Gaussian used to induce a distribution 
on the cyclotomic polynomial ring (default 16.0); 

• qpow, the power of 2 to use for the coefficient modulus (default 128); 

• t, the value to use for the message space modulus (default 32768). 


which are arguments of the pars function, the hrst argument of which selects the encryp¬ 
tion scheme to use. Presently only Fan and Vercauteren (2012) is implemented ("FandV"). 
Say the defaults are acceptable except for the message space modulus which should be 
doubled, it could be set by running: 


> p <- parsC'FandV" , t=65536) 

> P 

Fan and Vercauteren parameters 
phi = x''4096+l 

q = 340282366920938463463374607431768211456 (128-bit integer) 
t = 65536 

delta = 5192296858534827628530496329220096 
sigma = 16 

Security level approx 128-bits 

Supports multiplicative depth of 3 with overwhelming probability 
(i.e. lower bound, likely more possible) 

The selection of these parameters can be bewildering, so a feature of the package is a 
parameter helper function parsHelp which has a simpler set of options and will use the¬ 
oretical bounds from the cryptography literature to automatically select the parameters 
for you. These simpler choices are: 

• lambda, the security level required (bits), default is 80; 

• max, the largest absolute value you will need to store encrypted, default is 1000; 

• L, the deepest multiplication depth you need to be able to evaluate encrypted, 
default is 4. 

So, for example, we could have selected the ability to store a larger message, but requiring 
at least 140 bits of security: 

> p <- parsHelp ( "FandV" , lambda=140, max=65536) 

> P 

Fan and Vercauteren parameters 
phi = x''8192+l 

q = 365375409332725729550921208179070754913983135744 (158-bit integer) 
t = 131072 

delta = 2787593149816327892691964784081045188247552 
sigma = 16 

Security level approx 273-bits 

Supports multiplicative depth of 4 with overwhelming probability 
(i.e. lower bound, likely more possible) 
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Note that the parameter helper will often be conservative in setting the secnrity level: 
the theoretical bonnds may not allow all constraints to be simnltaneously satished exactly 
and so the secnrity is treated as a minimnm reqnired level. 

Key generation 

With the parameter valnes stored in the ontpnt from either the pars or parsHelp fnnc- 
tions, generating the keys is then straight-forward: 

> keys <- keygen(p) 

The public key is stored in keysSpk, the secret key in keys$sk and the relinearisation 
key (used during multiplication operations) in keys$rlk. 

In order to save the keys for future use (or to give the public key to another party), 
see the saveFHE and loadFHE functions in the package. 

Encryption/Decryption 

Encryption and decryption take place according to the same format as in ^ with en¬ 
cryption requiring only the public key and decryption only the secret key: 

> ct <- enc(keys$pk, 2) 

> ct 

Fan and Vercauteren cipher text 

( c_0 = 13x-4096+137352383756050088155497465542996641770x''4095+7041424..., 
c_l = -46602244866771058520503760576262076531x''4095-4071136373586351... ) 

> m <- dec(keys$sk, ct) 

> m 

[ 1 ] 2 


Homomorphic operations 

Using homomorphic operations is the simplest aspect of all, because the package operator 
overloads -F and x (*) operations and also implements native support for vectors and 
matrices being encrypted to cipher texts. 

The following example mixes scalars, vectors and matrices and showcases both scalar 
and matrix operations: 


> X <- 2 


> y <- c(l,2,3) 

> z <- matrix(1 

: 9 , nrow- 

> z 


1-1 

1_1 

1-1 

to 

1_1 

[,3] 

T—1 

1—1 

1_1 

7 

[2,] 2 5 

8 

[3,] 3 6 

9 


> xct <- enc(keys$pk, x) 

> yet <- enc(keys$pk, y) 
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> zct <- enc(keys$pk, z) 

> res <- X + y ’/o*"/ t(z) °/„*7o z 

> reset <- xct + yet 70 *% t(zet) 7o*7o zet 

> res 

[,1] [,2] [,3] 

[1,] 230 554 878 

> dee(keys$sk, reset) 

[,1] [,2] [,3] 

[1,] 230 554 878 

D.2 Introduction to the EncryptedStats package 

The first two methods implemented in the package are the two methods presented in this 
paper. 

Completely Random Forests 

There are three steps in working with CRFs in the package: i) grow a forest; ii) £t data 
to each tree (accumulate votes); and hi) predict a new observation. 

When working with CRFs, the data must already be in the representation of Method 
1 from §2.2.2 in either an encrypted or unencrypted matrix in R. 

Imagine a toy data set consisting of N observations of two variables (vl and v2) where 
the hrst is ordinal and the second categorical. Assume that these are in quintiles so that 
the data matrix is x 10, stored in the variable called X, say, in R. Also assume the 
responses of the training observations are similarly stored in y. 

i) Forest growth 

Tree growth is blind from the data, so you only need to identify the variable names of 
each column of the (un)encrypted data matrix and their type — either ordinal ("ord") 
or categorical ("cat"). For the toy example, 

> Xvars <- rep(c("vl", "v2"), each=5) 

> Xvars 

[1] "vl" "vl" "vl" "vl" "vl" "v2" "v2" "v2" "v2" "v2" 

This vector of names is used to tell the tree growth algorithm that the hrst 5 columns 
of X contain a quantisation of variable vl and the second 5 columns contain a quantisation 
of variable v2. 

> Xtype <- c("vl"="ord" , "v2"="cat") 

> Xtype 

vl v2 
"ord" "cat" 


This vector then specihes the type for each variable. Together these can be passed to 
the forest growth algorithm, along with the speciheation of the number of trees to grow 
(T) and how deep to grow the trees (L): 
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> forest <- CRF.growCXvars, Xtype, T=200, L=2) 


ii) Fit the data 

The htting step is now straight-forward. The forest is given to the htting fnnction, 
together with the data matrix and the responses, either encrypted or nnencrypted. The 
only additional option is what size of stochastic fraction estimate to use (the M parameter 


from ^3.2), passed as argument resamp. 


To do this unencrypted, one simply runs: 


> fit <- CRF.fit (forest, X, y, resamp=8) 


If X and y are encrypted, then the algorithm requires the value 1 encrypted, as well 
as M both encrypted and in plain text, since the stochastic fraction estimate is M — 
Vi + 1) where the rji come from X. Therefore, the public key must be provided so 
that the algorithm can produce encrypted versions of these. 


> fit <- CRF.fit (forest, X, y, resamp=8, pk=keys$pk) 


For the Fan and Vercauteren (2012) scheme in particular, it is possible to add un¬ 
encrypted values to encrypted ones knowing only the parameters of the scheme, so this 
requirement may be removed in a future release of the package. 


iii) Prediction 

Imagine now that there are N' test observations for the toy example, similarly stored 
in an unencrypted or encrypted N' x P matrix, newX. Then prediction proceeds using 
the forest and £t from the previous commands: 


> pred <- CRF.pred(f orest, fit, newX) 


For similar reasons to above, the public key is required when computing with en¬ 
crypted data, 

> pred <- CRF.pred(f orest, fit, newX, pk=keys$pk) 

Upon completion, pred will be an x 2 matrix with votes for class j on observation 
i being in the {i,j)th entry. 


Semi-parametric NaiVe Bayes 


There are two steps in working with the SNB classiher in the package: i) £t a model to 
the data, i.e., estimate the factors that make up the parameters {6,a/(3}] ii) predict a 
new observation. 

The user has the choice of working with encrypted or unencrypted data (this is auto¬ 
matically detected), with the provision that both steps must use the same type of data, 
i.e., either both use encrypted data or both use unecrypted data. 

Assume we have an iV x P design matrix X and a length N binary response vector 
y; and let cX and cy denote the corresponding ciphertexts, encrypted with the package 
HomomorphicEncryption (see ID.l). 
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i) Estimation 

Besides the data there are only two parameters in the estimation step: paired controls 
whether paired or unpaired optimisation is to be used (see ^4.2); and pk is the public 
key used to encrypt cX and cy — it is used to encrypt some constants required in the 
estimation process and is only needed if encrypted data is used. 

Estimating the class counts {counts(?/ = 0),counts(?/ = 1)} and the coefficients 
{aj,bj,dj} and is then straightforward using the htting function SNB.fit. Say = 10 
and P = 2; then, with unencrypted data we obtain 


> snbl.fit <- SNB.fitCX, y, paired=T) 

> snbl.fit 
Scounts.y 
[1] 4 6 
$coeffs 

[, 1 ] [. 2 ] 

a 110 462 

b -8 -120 
d 221 105 


while with encrypted data we obtain 

> snb2.fit <- SNB.fitCcX, cy, paired=T, pk=keys$pk) 

> snb2.fit 
Scounts.y 

Vector of 2 Fan and Vercauteren cipher texts 
Scoeffs 

Matrix of 3 x 2 Fan and Vercauteren cipher texts 


We can then conhrm that encrypted and decrypted results match: 

> dec(keys$sk, snb2.fit$counts.y) == snbl.fitScounts.y 
[1] TRUE TRUE 

> dec(keys$sk, snb2.fitScoeffs) == snbl.fitScoeffs 

[, 1 ] [. 2 ] 

a TRUE TRUE 
b TRUE TRUE 
d TRUE TRUE 


ii) Prediction 

The function SNB.fit generates an object of class SNB, which can be directly fed 
into the base R predict function. Apart from the htted model and the testing data 
newX, there are only two parameters: type controls the type of prediction to the returned 
(raw coefficients or conditional class probabilities); sk is the secret key — required if 
predictions are to be decrypted. 

Continuing the example above, we can obtain the probabilities of class y = 1, 




Encrypted statistical machine learning 


37 


> snb2.predProb <- predict (model=snb .fit2, newX=cX, type="prob", sk=keys$sk) 

> snb2.predProb 

[1] 0.2212818 0.1973391 0.7221404 0.8975560 0.7293452 0.4353264 0.4621838 
[8] 0.4442447 0.7364313 0.8907048 


or obtain the class connts and coefficients 


{cij, dj} (see ^4.2) 


for all testing observations, 


> snb2.predRaw <- predict (model=snb.fit2, newX=cX, type="raw", sk=keys$sk) 

> snb2.predRaw 
Scounts.y 

[1] 4 6 
Scoeffs.e 



[,1] 

[,2] 

[1,] 

102 

-138 

[2,] 

70 

-138 

[3,] 

86 

102 

[4,] 

102 

222 

[5,] 

94 

102 

[6,] 

70 

-18 

[7,] 

94 

-18 

[8,] 

78 

-18 

[9,] 

102 

102 

[10,] 

86 

222 


Scoeffs.d 
[1] 221 105 
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E Details of data sets 





Table 2: Dimensions of the data sets nsed (shown 

in the same order as in Fignre 

All 

data sets were extracted from UCI Machine Learning Repository ( 

Lichman, 

2013 

)• 


#obsi 

#obs2 

T^varsi 

#vars2 

#vars3 

acute inflammation 

120 

120 

6 

11 

15 

acute nephritis 

120 

120 

6 

11 

15 

adult income 

48842 

45222 

14 

105 

129 

bank marketing 

41188 

30488 

20 

57 

97 

blood transfusion 

748 

748 

3 

3 

15 

breast cancer wise diag 

569 

569 

30 

30 

150 

breast cancer wise orig 

699 

683 

9 

9 

45 

breast cancer wise prog 

198 

194 

32 

32 

160 

chess krvkp 

3196 

3196 

36 

73 

73 

haberman survival 

306 

306 

3 

14 

22 

heart disease Cleveland 

303 

297 

13 

28 

48 

ionosphere 

351 

351 

32 

32 

160 

magic telescope 

19020 

19020 

10 

10 

50 

mammographic masses 

961 

830 

4 

14 

18 

monks3 

554 

554 

6 

17 

17 

muskl 

476 

476 

166 

166 

830 

musk2 

6598 

6598 

166 

166 

830 

ozone Ihr 

2536 

1848 

72 

72 

360 

spambase 

4601 

4601 

57 

57 

285 


^i^obsi: number of observations in the original dataset; 7^obs2: number of observations after removing missing data; #varsi: 
number of predictors in the original dataset; #vars2: number of predictors after transforming factors into sets of binaries 
(which increases the number of predictors) and continuous predictors into quintile-discretised predictors (which does not 
increase the number of predictors); #vars3: number of predictors after also transforming quintile predictors into sets of 


binaries. 
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F Completely random forest parameter performance 


AUC 

0.0 0.2 0.4 0.6 0.8 1.0 



Figure 7: Performance of various methods. For each model and dataset, the AUC for 100 
stratihed randomisations of the training and testing sets; the horizontal lines represent 
the frequency of class y = 1. M = 8 throughout. 




























































